Internet of Things (IoT) devices are increasingly found in homes, providing useful functionality and convenience, such as smart speakers, TVs, and video doorbells. Along with their benefits come potential risks, since these devices can communicate information (audio recordings, television viewing habits, video recordings, etc.) about their users to other parties over the Internet. To help understand and mitigate these risks, in the Mon(IoT)r Lab at Northeastern University we are developing tools for measuring the behavior of IoT devices at a scale, such as their network traffic and any other visual signals they emit.
As part of the Digital Lab fellowship at Consumer Reports, we are focusing on understanding and mitigating privacy risks for smart speakers, under the observation that these devices have the capability to transmit, and/or record audio from their environments at any time.
In a previous study, we have shown, by using a variety of audio materials played back at some smart speakers in a controlled manner, that different generations of smart speakers from the four leading manufacturers record and upload conversations when they should not (i.e., when their wake word is not spoken). However, we did not show what factors increase this risk, if the devices are improving over time, and if there is anything we can do to reduce the risk of privacy exposure when using these devices.
In this fellowship, we reengineered the test infrastructure by making it fully automated, more realistic (for example by using multiple copies of the same devices), and by also expanding our audio material. Our audio material now includes hundreds of hours of voice samples from the Mozilla Common Voice database and additional hundreds of hours of dialogue from popular Netflix sitcoms. We have categorized each new unit of audio material into gender, age group, race, and accent. The question we want to answer with these new experiments is if these devices show any biases depending on the category of audio material, thus revealing whether people from certain categories are more at risk of being recorded or misunderstood by a smart speaker. Our results so far have shown that, compared to one year ago, smart speakers from all the manufacturers we tested still record when they should not, but much less than before, which is good news. Moreover, we have not seen yet strong evidence of discriminatory bias, although some data is still being analyzed. So far we have also seen that the crystal-clear sentences from the Mozilla Common Voice do not activate the devices as much as sitcom shows do. An explanation for this is that sitcom dialogue sometimes is faster and less clear because of background sounds and emotions, which may confuse the smart speakers.
As part of our analysis of smart speakers, we have also analyzed their network traffic to see whether any of it were suspicious. The bad news is that we have seen traffic going to tracking and analytical services, meaning that smart speakers are doing something more than just answering our questions when we intentionally or unintentionally trigger them. The good news is that we can block some of such traffic without affecting the functionality of the devices, thus reducing the opportunities for the device to expose user information. In the case of Amazon Echo devices, after performing an extensive number of controlled experiments, we were able to consistently block 8 out of their 13 destinations, while for Google Home devices we were able to block 9 out of 14. We have also observed this phenomenon for some IoT devices that are not smart speakers as well: 28 other different IoT devices in our testbed produce network traffic that is not needed for the device to function, and therefore can be blocked.
Another concerning behavior we have noticed is that some smart speakers communicate with each other if they are in proximity, even if they are owned by different users, set up on different accounts, and connected to different networks using different Internet connections. We have seen this behavior when two unrelated Homepod devices were unable to answer “Hey Siri” at the same time, meaning that an unrelated device can prevent an existing device from responding.
Finally, one of our final goals is to propose possible ways to defend from the risk of privacy exposure from smart speakers activating when they should not. To this end, we are experimenting with open-source wake word recognition engines to give users the ability to use wake words that are harder to trigger accidentally. This would provide a more transparent and reliable way to trigger the functions of a smart speaker, thus mitigating the risk of accidentally recording conversations. We are also exploring ways to “mute” smart speakers by blocking their network traffic when the wake word is not detected, to prevent the recording being sent. A network traffic mute can also be useful as an alternative to their physical mute button, since such button is typically implemented via software, and therefore if the software of the device malfunctions due to a bug or it is compromised, the device microphone can stay active even if it is disabled.